Approximate All-Pairs Suffix/Prefix Overlaps
نویسندگان
چکیده
Finding approximate overlaps is the first phase of many sequence assembly methods. Given a set of r strings of total length n and an error-rate , the goal is to find, for all-pairs of strings, their suffix/prefix matches (overlaps) that are within edit distance k = d `e, where ` is the length of the overlap. We propose new solutions for this problem based on backward backtracking (Lam et al. 2008) and suffix filters (Kärkkäinen and Na, 2008). Techniques use nHk + o(n log σ) + r log r bits of space, where Hk is the k-th order entropy and σ the alphabet size. In practice, methods are easy to parallelize and scale up to millions of DNA reads.
منابع مشابه
Improved Filters for the Approximate Suffix-Prefix Overlap Problem
Computing suffix-prefix overlaps for a large collection of strings is a fundamental building block for the analysis of genomic next-generation sequencing data. The approximate suffix-prefix overlap problem is to find all pairs of strings from a given set such that a prefix of one string is similar to a suffix of the other. Välimäki et al. (Information and Computation, 2012) gave a solution to t...
متن کاملTwo Efficient Techniques to Find Approximate Overlaps between Sequences
The next-generation sequencing (NGS) technology outputs a huge number of sequences (reads) that require further processing. After applying prefiltering techniques in order to eliminate redundancy and to correct erroneous reads, an overlap-based assembler typically finds the longest exact suffix-prefix match between each ordered pair of the input reads. However, another trend has been evolving f...
متن کاملSuffix Trees and Suffix Arrays
Iowa State University 1.1 Basic Definitions and Properties . . . . . . . . . . . . . . . . . . . . 1-1 1.2 Linear Time Construction Algorithms . . . . . . . . . . . . . 1-4 Suffix Trees vs. Suffix Arrays • Linear Time Construction of Suffix Trees • Linear Time Construction of Suffix Arrays • Space Issues 1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...
متن کاملAlgorithm Engineering for All-Pairs Suffix-Prefix Matching
All-pairs suffix-prefix matching is an important part of DNA sequence assembly where it is the most time-consuming part of the whole assembly. Although there are algorithms for all-pairs suffix-prefix matching which are optimal in the asymptotic time complexity, they are slower than SOF and Readjoiner which are state-of-the-art algorithms used in practice. In this paper we present an algorithm ...
متن کاملA Practical and Scalable Tool to Find Overlaps between Sequences
The evolution of the next generation sequencing technology increases the demand for efficient solutions, in terms of space and time, for several bioinformatics problems. This paper presents a practical and easy-to-implement solution for one of these problems, namely, the all-pairs suffix-prefix problem, using a compact prefix tree. The paper demonstrates an efficient construction of this time-e...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010